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Many authors have thought classical test theory was 
invalid for criterion--ref erenced tests. The item (difficulty 
and discrimination) and test (reliability and validity) 
statistics in classical test theory are highly dependent upon 
the calibration sample of individuals used* We may speak of 
the estianates of item and test parameters in classical test 
theory as valid within a range of interest along the charac- 
teristic measured. It has generally been the case that this 
range of interest is the distribution of the characteristic 
in some population and the calibration sample used is intended 
to be a random sample from that population. In such popula- 
tions, it' is usually the case that the extremes are poorly 
represented and the parameter estimates are relatively poor 
at these extremes. 

For criterion-referenced scales the range of interest is 
defined by a range of the characteristic rather than the dis- 
tribution of that characteristic in some population. The * 
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calibration sample must be representative of that range of 
interest. Wlien the range of interest is appropriately defined, 
an appropriate calibration sample may be selected, and classica 
test theory applies directly to criterion-referenced scales. 



Classical Tjest Theory and Criterion-Referenced Scales 



M. I. Chas. £• Woodson 
University o£ California » Berkeley 



A \9ide variety of definitions of "criterion-referenced test" have been 
suggested (e.g.^ Glaser^ 1963; Glaser & Nitko, 1971; Harris & Stewart, 1971; 
Ivens, 1970; Krlewall, 1969; Popham & Husek, 1969; Hively, Patterson & Page, 
1968). Common to these definition is an emphasis on the interpretation of 
test outcomes in terms of behavior. We shall take the position that 
"criterion-referenced" is not a property of the test but of a scale for 
interpreting the test (Woodson^ 1973a), although it seems likely that the 
kind of scale one has in mind using with a test will have an impact on test 
construction procedures. Our definition of "criterion-referenced" is close 
to that of Glaser and Nitko (1971) : "A criterion-referenced test is one 
that is deliberately constructed so as to yield measurements that are directly 
Interpretable in terms of specific performance standards." We would modify 
this to refer to scales » and rather than limit ourselves to a cutoff score 
associated with a standard, place individuals on a scale interpretable in 
terms of behavior. Therefore, "a criterion-referenced scale is one that 
yields measurements directly Interpretable in terms of some specific dimen- 
sion of behavior." Note that it is not designed to most effectively rank 
individuals within a population. ... 
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In our judgment, there has been an over emphasis on determining 
whether an individual has exceeded a standard in order to stop instruction 
on that objective. Instructors need to know vhcre the individual is on 
a dimension^ iearning. For example, Woodson (1973c) found the effectiveness 
of instructional steps differed considerably at different degrees of 
learning. If other studies find this, degree of learning will be a signi- 
ficant parameter in instructional models* 

It has been argued that classical test theory does not apply to 
criterion-referenced tests (Popham & Husek, 1969) because under some common 
circumstances criterion-referenced test items and the tests themselves are 
likely to have no variance and a lack of variance makes the common statis- 
tics (item difficulty, item discrimination, test reliability and test 
validity) invalid or undefined* Woodson (1974) has argued that this argument 
is falacious as all items and tests must have variance within the range of 
interest for which they are calibrated in order to provide any useful 
information* 

The above argument suggests that classical test theory may therefore be 
relevant u» criterion-referenced test and item analysis. The present paper 
argues that this is the case. 

In classical test theory, item and test parameters are estimated by 
statistics from a calibration sample of individuals. For classical test 
theory the calibration of a test or item must be done within a population 
of testees with appropriate variability on the characteristic measured. The 
distribution of the characteristic of interest in the population sampled, 
and therefore the distribution of the characteristic in the sample, determines 
in part the parameter estimates. These statistics are known to be sensi-^ 
tive to restriction of the sample. 
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Item difficulty within a population, estimated by difficulty within^ 
the sample, obviously depends upon the distribution of the characteristic 
in the sample. To skillful individuals, an item is much easier than to 
less skillful individuals « 

Estimates of item discrimination within a population are also sensi- 
tive to the characteristics of the calibration sample used. If the cali- 
bration is restricted in some way, the estimates may be unreliable. 

Test reliability within a population, the most commonly used parameter 
to evaluate a test, is known to be sensitive to the characteristics of the 
calibration sample. 

These classical test theory statistics are referred to here as "within 
the population" to emphasize the characteristic that they are bound by the 
population which the sample represents. In most cases random sampling from 
the population is assumed, so the statistics apply for a specified population 
of testces (e.g., 4th, 5th and 6th graders). 

Another way of conceptualising this situation is to refer to these 
statistical estimates of the parameters Involved as valid within a range 
of interest. For norm-referenced scales, the calibration sample is a random 
sample of a population, the distribution of the characteristic in this 
population defines the range of interest. Such a scale is norm~-referenced 
in that the scale is dependent upon the population represented by the 
calibration-sample for its meaning. 

Criterion-referenced scales are scales whose meaning refers to the 
characteristic measured rather than the distribution of the characteristic 
in some population. It is therefore necessary to estimate item and. test 
parameters with etatistica valid for the range of interest within which the 
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test will be used. This can be done by specifying the characteristics of 
the population for which the test is to be calibrated. 

In the case of criterion-referenced scales, the items or test statis- 
tics also apply to a "range of interest" > that is, a range of the character- 
istic for which data is available and the item and test are calibrated. 
In the norm-referenced test, this is jspecif led as the range of the charac- 
teri^tic in the population. 

The same item and test statistics of classical test theory used for 
norm-referenced scales apply to criterion-referenced scales, provided the 
range of interesc is appropriately specified. One way of specifying this 
range of interest (W. E. Coffman, personal communication) is to include in 
the calibration sample equal numbers of individuals who have received and 
have not received relevant instruction. A more general procedure is to 
choose a calibration sample which contains an adequate representation of 
the range of the characteristic to be measured* 

Difficulty within the range of Interest is therefore a relevant charac- 
teristic for item analysis. Discrimination within the range of interest is 
the most useful statistic for the selection of items « Test reliability and 
validity also have the same meaning for criterion-referenced scales as they 
do for norm-referenced scales. 

Note, however, that no matter what the typ^>o^^ scale is being used, if 
the calibration sample is highly restricted, or not representative of the 
range of interest, the item and test statistics are not valid estimates of 
the parameters of interests For example, if the ability of the calibration 
sample is so high that the items of a particular test are trivially easy, 
this restriction of the sample makes the statistics invalid for any range of 
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interest other than the one upon which the test was calibrated, and trivial 
within that range because the item (or test) does not discriminate. 

The empirical characteristics of items within a calibration sample 
are used to select items for a test and thereby contributes to the deter- 
mination of what a test measures. If a norm-ref erenced approach xs taken, 
items which do not measure a characteristic which varies within the cali- 
bration sample tend to be discarded, and items which vary greatly within 
the calibration sample tend to be selected. For criterion-ref erenced tests, 
the reference is not a population but a range of a dimension of behavior. 

Examples 

Consider the problem of the development of a spelling test and related 
norm-referenced and criterion-referenced scales. The characteristic involved 
is spelling ability within the 500 most frequently misspelled English words. 

The norm-referenced approach to construct a 10 item test would be: 
1. Select a sample (not necessarily randomly) of the items, 
2« Administer to a calibration sample of individuals p randomly sampled 
from the population which defines the range of interest within which 
the item and test parameter estimates will be valid, 

3. Compute item difficulties within the sample which are estimates of the 
difficulty in the range of interest (about .5 is desirable), 

4. Compute item discriminations within the sample (the higher the better), 

5. Select the 10 items with the best discrimination estimates, 

6. Norms are prepared for the population for which the test is designed 
(7th, 8th^ 9th graders), 

7. Individual performance would be described in terms of how individuals 
compared to the distribution of scores in the standardization sample 
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(e.g4, rank order within 7th graders^ or grade-equivalent scores) r 
8. Thri resulting scale . may be referred to as spelling ability relative 
to the distribution of abilities of a particular population of persons 
on those items which these persons differ most frequently. 
The purpose cf this scale is to discriminate among persons on spelling 
ability^ ther(2fore items selected will tend to be ones on which persons 
differ the most. In other words j, differences among persons in the cali- 
bration sample will contribute to the definition of what is measured. 

The criterion-referenced scale approach to construction of a 10 item 

test; 

lo Select a satiple (not necessarily randomly) of the items. 

2o Administer the items to a calibration sarr.ple« The calibration sampla is 
selected to be representative of the population of observations (range 
of interest) for which the items and test are to be calibrated. If an 
instructional program is being assessed, this would include appropriate 
proportions of persons to represent every value of the characteristic 
in question in the range of interest • 

3o Compete item difficulties within the calibration sarapley (.5 would give 
most effective measurement near the center of the range of interest, 
other values are needed for the extremes) • 

4« Compute item discriminations within the calibration sample, (the higher 
the better) . 

5* Select the 10 items with the best discrimination estimates within the 

calibration sample^ ^ 
6. Scores of individuals are in terms of the selected items within the 

range of interest. An individual score on this scale does not rank 
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him wll:h respect to others, but places him on a scale defined by the 
items. 

7<, The resulting scale may be referred to as spelling ability on the 

500 most misspelled words* 
This scale is not dependent upon the distribution of the characteristic 
in a population* 

Note that for the criterion-referenced scale, items are eliminated for 

being inappropriate within the range of interest, which is not necessarily 

the distribution of the characteristic in some standardization population. 

natural 

It may well be a range at the extreme of some ^popular ion of individuals. 

It may also include observations on an Individual various levels of learning. 

In the limiting case of a very short range of interest, item discrimina- 
tion and test reliability go to zero. 

In the limiting case of a very broad range of interest^ observations may 
be difficult to obtain to reliably estimate parameters. This is quite 
reasonable, we cannot calibrate a test by classical test theory for a range 
of a characteristic of vrhich we. have few or no instances. 

In the limiting case where a population of Individuals is randomly 
sampled-, we have> of course, the classical norm-referenced situation. 

In short J the range of interest and therefore the calibration semple, 
in which a test is developed and calibrated defines the range of the charac- 
teristic for which the test is useful. 

This paper has taken the approach of using a calibration sample represen-- 
tative of the range of interest of the characteristic and using classical 
test theory to develop and evaluate a test* This is necessary because the 
estimates of item and test parameters used in classical test theory are 
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sensitive to the calibration sample. Modern test theory may well free us 
of the burden of saii\p3.e--bound calibration. The two-parameter logistic 
model yields sample-free calibration In theory (Ra ch, 1966 ) and In practice 
(Wright, 1968). There Is also evidence (Woodson, 1973; Same-jlraa, 1973) 
that the three-parameter normal-ogive model gives relatively sample-free 
calibration. Sample-free calibration may not require the specification of 
a range of Interest. 

Pending fortunate developments In test theory^ the developer of criterion- 
referenced scales is best advised to select a calibration sample represen- 
tative of the range of Interest of the characteristic to be measured, and 
use the Item statistics and test statx cs of classical test theory^ 
bearing In mind that the estimates he parameters obtained are valid only 
for a particular range of the char iieristlc in question*. 
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